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Abstract 

In the mixture modeling frame, this paper presents the polynomial Gaussian cluster-weighted 
model (CWM). It extends the linear Gaussian CWM, for bivariate data, in a twofold way. Firstly, it 
allows for possible nonlinear dependencies in the mixture components by considering a polynomial re- 
gression. Secondly, it is not restricted to be used for model-based clustering only being contextualized 
in the most general model-based classification framework. Maximum likelihood parameter estimates 
are derived using the EM algorithm and model selection is carried out using the Bayesian infor- 
mation criterion (BIC) and the integrated completed likelihood (ICL). The paper also investigates 
the conditions under which the posterior probabilities of component-membership from a polynomial 
Gaussian CWM coincide with those of other well-established mixture-models which are related to it. 
With respect to these models, the polynomial Gaussian CWM has shown to give excellent clustering 
and classification results when applied to the artificial and real data considered in the paper. 

Key words: Mixture of distributions, Mixture of regressions, Polynomial regression, Model-based 
clustering, Model-based classification, Cluster-weighted models. 

1 Introduction 



Finite mixture models a re commonly employed in statistical modeling with two different purposes 



(jTitterington et al 



19851 pp. 2-3). In indirect appl ications, they are used as semipara metric competitors 



of no nparamctri c density estimation techn iques (see 



Titterington et al 



2000 . p. 8 and 



Escobar and West 



19851 pp. 28-29, 



McLachlan and Peel 



1995I ). On the other hand, in direct applications, finite mixture 
models are considered as a powerful device for clustering and classificatio n by assuming that each 



mix ture-component represents a group (or cluster) in the original data (see 



and 



McLachlan and Basforc 



Fralev and Rafterv 



1998 



19881 ). The areas of application of mixture models range from biology and 
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medicine (see 



Schlattmann 



20091) to e conomics and market i ng (s ee 



Wedel and Kamakura, 



overview on mixture models is given in 



McLachlan and Peell ( 2000 ) and 



Friihwirth- Schnatter 



2001). An 



200G) 



This paper focuses both on direct and indirect applications. The context of interest is represented by 
data arising from a real- valued bivariatc random vector (X, Y) in which a functional dependence of Y on 
x is assumed for each mixture-component. Note t hat, hereafter, all v ectors are considered to be column 
vectors. The linear cluster-weighted model (CWM; 



3 



Gershc nfcld 1997) constitutes a natural choice, in the 



mixtures frame, when this functional relationship is supposed to be linear. The (linear) CWM factorizes 
the joint density of (X, Y) , in each mixture-component, into the product of the conditional density of 
Y\x and the marginal density of X. A Gaussian distribu tion is usually used for both of them leading to 
the (linear) Gaussian CWM (see 



Ingrassia et al 



2a.l 



2012allb for details on this model and for an extension 



to the t distribution). Generally, regardless from the de nsity shape, CWMs take s imultaneously into 



account the potential of finite mixtu res of regressions (s ee 



finite mixtures of distributions (see, 



Titterington et al 



Friihw: 



1985 and 



rth- Schnatter 



20061 Chapter 8) and of 



McLachlan and Peel 



20001 ): the idea of 



the former approach is adopted to model the conditional density of Y\x, while the principle of the latter 
is used to model both the joint density of (X, Y) and the marginal density of X. Unfortunately, if the 
component-dependence structure of Y on x is different from the linear one, the linear CWM is not able 
to capture it. 

To solve this problem, the present paper illustrates the polynomial Gaussian CWM. It generalizes 
the linear Gaussian CWM by using a polynomial model for the dependence of Y on x in each mixture- 
component. Regarding the comparison between a polynomial Gaussian CWM and a finite mixture of 
polynomial Gaussian regressions, two related aspects need to be preliminarily highlighted: from an in- 
direct point of view, in fact, they are not comparable since the latter is not conceived to model the joint 
density of (X, Y) , but only the conditional density of Y\x, and this difference, from a direct point of 
view, implicitly affects the criterion (vertical distance for finite mixture of regressions) used for cluster- 
ing and classification. With respect to the latter point, in the paper we will show both artificial and 
real data on which the polynomial Gaussian CWM outperforms finite mixtures of polynomial Gaussian 
regressions. Also, most of the existing literature and applications for bivariate mixture models are based 
on Gaussian components. Mi xtures of t distributions hav e been considered as robust alternatives (see 



Peel and McLachlan 2000 and 



mode l-based clustering 



Celeux and Govaert 



19981 ) and model-based classification 



1995 



Dean et al 



Greselin and IngrassklLoiOD 



Banfield and Rafterv 



2006 and 



Andrews et al 



1993, and 



McLachlan and Peel 



20111 ). However, Gaussian 



and t models imply that the groups are elliptically contoured, which is rather restrictive. Further- 



more, it is well known that non-elliptical subpo pulations can be ap 



of several basic densities like the Gaussian one ( Fralev and Rafterv 



jroximated quite well by a mixture 



1998 



Dasgupta and Rafterv 



1998 
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McLachlan and Peel 



2000 p. 1 and 



Titterington et al 



198a p. 24). While this can be very helpful for 



modeling purposes, it can be misleading when dealing with clustering applications since one group may 
be represented by more than one component just because it has, in fact, a non-elliptical density. A 
common approach to treat non-elliptical subpopulations consists in considering transformations so as to 
make the components as e lliptical as pos sible and then fitting symmetric (usually Gaussian) mixtures 



(Gutierrez et 



1995 and 



Lo et al 



20081 ). Although such a treatment is very convenient to use, the 
achievement of joint ellipticality is rarely satisfied and the transformed variables become more difficult 
to interpret. Instead of applying transformations, there has been a growing interest in proposing finite 
mixture models where the component densities are non-ell iptical (above all in terms o f sk ewness) so 



(Karlis and Santourian 


2009 


and 


Lin 


2009 



2010) . While such models can be used to create finite mixture models and provide alternative shapes 
for the derived clusters, they have certain computational difficulties. On the contrary the polynomial 
Gaussian CWM, which allows the bivariate density components to be flexible enough by allowing the 
polynomial degree to increase, is easily applicable, has much simpler expressions for estimation purposes, 
and generates easily interpreted clusters. 

The paper is organized as follows. In Section [5] the polynomial Gaussian CWM is presented and, in 
Section |31 its use is contextualized in the presence of possible labeled observations among the available 
ones. In Sectional maximum likelihood estimation of the model parameters is approached by considering, 
and detailing, the EM-algorithm. With this regard, computational details are given in Section [5] while 
methods to select the number of components and the polynomial degree are given in Section [B] Some 
theoretical notes about the relation, from a direct point of view, with the classification provided by 
other models in the mixture frame, are given in Section 17.11 Artificial and real data are considered in 
Section 17.21 and Section 17.31 and the paper closes, with discussion and suggestions for further work, in 
Section El 



2 Model Definition 

Let 



k 

p{x,y;ip) = ^TTjf (x,y;^j) (1) 



be the finite mixture of distributions, with k components, used to estimate the joint density of (A, Y) . 
In (HJ), / (•; -dj) is the parametric (with respect to the vector #j) density associated to the jth component, 
TTj is the weight of the jth component, with ttj > and $3j=i w j = 1> ant ^ V* = ( 7r '>'$ / ) ' W1 th tv = 
(711, . . . , 7Tfe_i) and i9 = (i?i, . . . , $&)', contains all the unknown parameters in the mixture. As usual, 



3 



model ([!} implicitly assumes that the component densities should all belong to the same parametric 
family. 

Now, suppose that for each j the functional dependence of Y on x can be modeled as 



Y = fij (x)+Sj, 



(2) 



where fij (x) = E (Y\X = x, j) is the regression function and Ej is the error variable having a Gaussian dis- 
tribution with zero mean and a finite constant variance u\. , hereafter simply denoted by Ej ~ N ^0, (T^J . 
Thus, for each j, Y\x ~ N (^fij (x) , cr^ J ■ In the parametric paradigm, the polynomial regression function 



Hj (x) = \i r (x; (3j) = ^2 Pij xl = P'j x 



(3) 



1=0 



represents a very flexible way to model the functional dependence in each component. In /3 ■ = 
(f3oj, . . . , (3 r j)' is the (r + l)-dimensional vector of real parameters, x is the (r + l)-dimcnsional 
Vandcrmonde vector associated to x, while r is an integer representing the polynomial order (degree) 
which is assumed to be fixed with respect to j. 

The mixture model ([T]) becomes a polynomial Gaussian CWM when the jth component joint density 
is factorized as 

/ (x, y, &j) =<$> (y x; fi r (x; fy) , J (/> (x; fi X \j,Ox\j) ■ ( 4 ) 

In Q, (/)(■) denotes a Gaussian density; this means that, for each j, X ~ N (ux\ji a x\jj • Summarizing, 
the polynomial Gaussian CWM has equation 



K 

P (x, y; ip) = ^2 7T 3 ( I ) (v x 'i (x; f3j) ,a^^j<t> (x; Hx\j,a 



3=1 



ho. 



(5) 



Note that the number of free parameters in is rj = kr + Ak — 1. M oreover, if r = 1 in equa tion ([3]), 



model (|5|) corresponds to the linear Gaussian CWM widely analyzed in 



Ingrassia et al 



(|2012ah . 



3 Modeling framework 

As said in Section [TJ the polynomial Gaussian CWM, being a mixture model, can be also used for direct 
applications, where the aim is to clusterize/classify observations which have unknown component mem- 
berships (the so-called unlabeled observations). To embrace both clustering and classification purposes, 
we have chosen a very general scenario where there are n observations (xi, y\) , . . . , (x n , y n ) , m of which 
are labeled. As a special case, if m = 0, we obtain the clustering scenario. Within the model-based 



4 



classification framework, we use all the n observations to estimate the parameters in (|5|) ; the fitted mix- 
ture model is so adopted to classify each of the n — m unlabeled observat i ons through the correspon ding 



Hosmer Jr 



( 1973 ) 



Titterington et al 



(1985, Sec 



maximum a posteriori probability (MAP). Drawing on 
tion 4.3.3) pointed out that knowing the label of just a small proportion of observations a priori can 
lead to improved clustering performance. 

Notationally, let Zi be the fc-dimensional component-label vector in which the jth element z%j is 
defined to be one or zero according to whether the mixture-component of origin of is equal 

to j or not, j = 1, . . . , fc. If the ith observation is labeled, denote with Zj = {z%i, . ■ ■ ,Zik) its com- 
ponent membership indicator. Then, arranging the data so that the first m observations are labeled, 
the observed sample can be denoted by S = {Si,S u }, where Si = yi, 2^) (x m ,y rn ,z' m ) j 

is the sample of labeled observations while S u = {(xm+iiVm+i) , . . . , (x n ,y n )'} is the sample of un- 
labeled observations. The completed-data sample can be so indicated by <S C = {Si, S*}, where S* = 
| (a; TO+ i, y m+1 , z' m+1 )' , (x n , y n , z' n )'\. Hence, the observed-data log-likclihood for the polynomial 
Gaussian CWM, when both k and r are supposed to be pre-assigned, can be written as 



m k 



I (VO = EE^' [tonj +1nf(xi,yi;-&j)] + ^ In 

6=1 j — 1 i — m+1 



53 7r,-/ (xi,yi;&j 

3=1 



(6) 



while the complete-data log-likclihood is 



m k n k 

= EE 2 ^ 1 [ ln7r J ■ +\nf (xi,yi;&j)] + ^2 E Zi i [ ln7r ? + ln f ( Xi > yi ' ^ 

i—1 j—1 i— m+1 J=l 

= he (IT) + he («) + he (£) , 



(7) 



where k = (k[, k'J with k 3 = (ux\j,ojc\^j , £ = • • • , i'k)' with £j = {■[,■"' ) > 



and 



he (tt) 
he (k) 



*8c (I) 



m k n k 

?—l j — 1 z— m+1 j — 1 



m k f af \ A 



i=l j = l 



^ n k 

2 E E^- 

i=m+l j — 1 



ln(2?r) - ln (a 



III n, 

2EE% 

i=l o = l 

n fc 

+2 E 5>« 

z=m+l j — 1 



In (2tt) - In <4 



■ln(27r)-ln(^ b .) - 



(xi-nx\jY 



2 1 



'X\j 



(8) 



(9) 



(10) 
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4 The EM algorithm for maximum likelihood estimation 



The EM algorithm (jDempster et al 



19771) can be used to maximize / (if?) in order to find maximum 
likelihood (ML) estimates for the unknown parameters of the polynomial Gaussian CWM. When both 
labeled and unlabeled data are used, the E and M steps of the algorithm can be detailed as follows. 



4.1 E-step 

The E-step, on the (q + l)th iteration, requires the calculation of 



Q(V;t/> (9) ) =E^ [l c W)\s: 



(11) 



As l c (i/>) is linear in the unobservable data 2y, the E-step - on the {q + l)th iteration - simply requires 
the calculation of the current conditional expectation of given the observed sample, where Z^j is the 
random variable corresponding to Zij. In particular, for i = m + 1, . . . ,n and j = 1, . . . , k, it follows that 



(9) 



(12) 



which corresponds to the posterior probability that the unlabeled observation (a;,-, j/j) belongs to the jth 
component of the mixture, using the current fit xp^ for rj>. By substituting the values z^ in ([7]) with 
the values z^f obtained in (fT2")l . we have 



Q (V; = Qi (tt; </> (9) ) + Q 2 (k; + Q 3 (C; V> (9) ) 



(13) 



G 



where 



i=i j=i 

m k 



Q 2 



(14) 



»=i 3=1 



-ln(27r) - In (<7^ 



(</) 



' 2 E E ~ l 

i— m+1 j — 1 



In 



(270 -ln (4) 



(Vi-PjX 



of 



(15) 



(g) 



EE : 

i=i j=i 



" 2 E E ~« 

i— m+1 j — 1 



hi 



(g) 



(2tt) - In [a x{j ) - 2 

a x\ 3 

In (2tt) - In ^ x]j .J — 2 



(16) 



4.2 M-step 

On the M-step, at the (q + l)th iteration, it follows from JI3J that tt^ 1 ', and | (<?+1) can be 

computed independently of each other, by separate maximization of (|14j) . (|15[l and (|16[) . respectively. 
Here, it is important to note that the solutions exist in closed form. 

Regarding the mixture weights, maximization of Q\ (^ir;xp( q ^ with respect to ir, subject to the 
constraints on those parameters, is obtained by maximizing the augmented function 



m k 



n k 



EE^ ln ^,+ E E4- in^-A E^- 1 > 



(17) 



i=l j=l 



i— m+1 j — 1 



J =1 



where A is a Lagrangian multiplier. Setting the derivative of equation (fTTj) with respect to Wj equal to 
zero and solving for ttj yields to 

m n 

E^+ E 

(g+1) _ i=l 



Si) 



z— m+1 



(18) 



With reference to the updated estimates of Kj, j = 1, . . . , fc, maximization of Q2 (k; ip^ J leads to 



m 



1 1 



E 



(g+i) 



ZijXi + Z ij' X i 

i—1 i— m+1 



E z y+ E ■ 

i—1 z— m+1 



T (9+l) 
T X\3 



jX^-mSJ 1 ') + E 



1/2 



•i— m-+l 



E^+ E - 

i—1 i— m+1 
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Finally, regarding the update estimates of ^ , j = 1, . . . , k, maximization of Q3 xp^^j , after some 
algebra, yields to 



0(9+1) 



A 



z— m+1 



2_-m+l 



i—m-\-l 



J2 Zi J+ E 



J2 z v+ E 



E z y + E 



..('/) 



£=m,+l 



/ = 1 



i__m+l 



i=m,+l 



E4 9) ^ + E z i?y* x j J2 Zi J yi 

i—1 i— m+1 i—1 2— m+1 



Z ij V* 5^ ^ij^i + Z if X J 



i—m-\-l 



V 



E^ + E 



y (9) 



i=l 



i=m+l 



E z ^ + E 



E z y + E 

i—1 i—m+l 
m n 

E/ Z ^ ^ 



y (9) 



E z y + E 



,(9) 



/ = l,...,r 



2 m + 1 



o(9+l) 



j— m+1 



(</) 



r (9+l) 



E z y+ E 

2—1 2— m+1 

/ m 

'E%(^-^? +1) 



E^ +1) — 



i__m+l 



/--I 



5>i+ E 



i— rn+l 



2__m+l 



E i 



'0(9+1) 



2 \ 



1/2 



V 



X> + E 



„(«) 



i—m-\-l 



J 



where the r-dimensional vector Xi is obtained from the Vandermondc vector Xi by deleting its first 
element. 

4.3 Some considerations 

In the following, the estimates obtained with the EM algorithm will be indicated with a hat. Thus, for 
example, ij) and 2y will denote, respectively, the estimates of ij) and Zij, i = m+1, . . . , n and j = 1, . . . , k. 
As said before, the fitted mixture model can be used to classify the n — m unlabeled observations via 
the MAP classification induced by 



MAP (%) = 



1 if max/, {zih} occurs at component j 
otherwise 



(19) 



i = m + 1, . . . , n and j = 1, . . . , k. Note that the MAP classification is used in the analyses of Section [7] 
As an alternative to th e EM algorithm, one could adopt the well-known Classification EM (CEM; 



Celeux and Govaert 



19921 ) algorithm, although it maximizes £ c (V0- The CEM algorithm is almost 



identical to the EM algorithm, except for a further step, named C step, considered between the standard 
E and M ones. In particular, in the C step the z\f in (fT2j) are substituted with MAP (z^), and the 



8 



obtained partition is used in the M-step. 



5 Computational issues 



Code for the EM algorit hm (as well as its CEM variant) described in Section [4] was written in the R 



computing environment (|R Development Core Team , 



2011) 



5.1 EM initialization 

Before running the EM algorithm, the choice of the starting values constitutes an imp ortant issue. The 
stand ard initialization consists in selecting a value for if>^ . An alternative approach (see ] 



McLachlan and Peel , 



2000L p. 54), more natural in the modeling frame described in Section [31 is to perform the first E-stcp by 
specifying, in equation (fl"2")) . the values o f z^°\ i = m + 1 ,.. . n, f or the unlabeled observatio ns. Among 



the possible initialization strategies (see 



according to the R-package f lexmix (jLeisch 



Biernacki et al 



2003 and 



Karlis and Xekalaki 



Grim and Leisch 



20031 for details) 



20081 ) which allows to estimate 



120041 and 

finite mixtures of polynomial Gaussian regressions - a random initialization is repeated t times from 
different random positions and the solution maximizing the observed-data log-likelihood among these t 
runs is selected. In each run, the n — m vectors zj ' are randomly drawn from a multinomial distribution 
with probabilities (1/fc, . . . , 1/fc). 



5.2 Convergence criterion 



The Aitken acceleration procedure ([Aitkenl . 1 1926T) is used to estimate the asymptotic maximum of the 
log-likelihood at each iteration of the EM algorithm. Based on this estimate, a decision can be made 
regarding whether or not the algorithm has reached convergence; that is, whether or not the log-likelihood 
is sufficiently close to its estimated asymptotic value. The Aitken acceleration at iteration k is given by 



,0) 



where l<- k+1 \ and Z^" 1 ' are the log-likelihood values from iteratio ns fc+ 1, k, and k — Ir respectively. 



Then, the asymptotic estimate of the log-likelihood at iteration k + 1 (jBohning et al 



19941) is given by 



j(fc+i) = Z (fc) + _J_ Uk+i) _ \ 

1 — aw V / 



In the analyses in Section [3 we follow 
with e = 0.05. 



McNicholas ( 2010 ) and stop our algorithms when Zm +1 ' ) — < e, 
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5.3 Standard errors of the estimates 



Once the EM algorithm is run, the covariance matrix of the estimated parameters t/> is determined by 
using the inverted negative Hessian matrix, as computed by the general purpose optimizer optim in the 
R-package stats, on the observed-data log-likelihood. In parti cular, optim is init ialized in the solution 



provided by the EM algorithm. The optimization method of iBvrd et al 
the po ssible options of the optim command. As underlined by 



Louis 



( 19951 ) is considered among 



19821) and 



Boldea and Magnus 



(|2009| ). among others, the complete-data log-likelihood could be considered, instead of the observed-data 
log-likelihood, in order to simplify the computations because of its form as a sum of logarithms rather 
than a logarithm of a sum. 



6 Model selection and performance evaluation 

The polynomial Gaussian CWM, in addition to if), is also characterized by the polynomial degree (r) and 
by the number of components k. So far, these quantities have been treated as a priori fixed. Nevertheless, 
for practical purposes, choosing a relevant model needs their choice. 



6.1 Bayesian information criterion and integrated completed likelihood 

A common way to select r and k consists in computing a convenient (likelihood-based) model selection 
criterion across a reasonable range of values for the couple (r, k) and then choosing the couple associated 
to the best value of the a dopted criterio n. Among the existing model selection criter ia, the Bayesian 



inform ation criterion (BIC; 



Schwarz, 



Biernacki et al. 



1978() and the integrated completed likelihood (ICL; 
2000h constitute the reference choices in the recent literature on mixture models. 

The BIC is commo nly used in model-based c lust ering and classifications applica tions involving a fam- 



ily of mixture models ([Fralev and Rafterv 



2002 



in mixture mode l selection was proposed 



to Bayes factors (jKass and Rafterv 



by 



and 



Daseupta and Rafterv 



McNichol as and Murphy 
~ J199J 



20081 ). The use of the BIC 



based on an approximation 



1995). In our context the BIC is given by 



BIC = 21 



(20) 



Lerouxl (199J) and 



Keribinl (|2000i ) present theoretical results that, under certain regulatory conditions, 
support the use of the BIC for the estimation of the number of components in a mixture model. 

One potential problem with using the BIC for model selection in model-based classification or clus- 
tering applications is that a mixture component does not necessarily correspond to a true cluster. For 
example, a cluster might be represented by two mixture components. In an attempt to focus model 
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selection on clusters rather than mixture components, IBiernacki et al\ (|2000D introduced the ICL. The 
ICL, or the approximate ICL to be precise, is just the BIC penalized for estimated mean entropy and, 
in the classification framework used herein, it is given by 



ICL w BIC + J2 MAP ^ ln ' : 

i=m-\-l j — 1 



(21) 



where X^"=m+i Sj=i MAP (zij) InSy is the estimated mean entropy which reflects the uncertainty in the 
classification of observation i into component j. Therefore, the ICL should be less likely, compared to 
the BIC, to split one clus ter into two mixture components, for example. 



Biernacki et al 



(200fj, p. 724), based on numerical experiments, suggest to adopt the BIC and the 



ICL for indirect and direct applications, respectively. 



6.2 Adjusted Rand index 

Although the data analyses of Section [7] are mainly conducted as clustering examples, the true clas 
sifications are actually k nown for these data. In these examples, the Adjusted Rand Index (ARI 



Hubert and Arabic 



Rand 



1985) is used to measure class agreement. The original Rand Index (RI 
19711 ) is based on pairwise comparisons and is obtained by dividing the number of pair agreements (ob- 
servations that should be in the same group and are, plus those that should not be in the same group 
and are not) by the total number of pairs. RI assumes values on [0,1], where indicates no pairwise 
agreements between the MAP classification and true group membership and 1 indicates perfect agree- 
ment. One criticism of RI is that its expected value is greater than 0, making smaller values difficult 
to interpret. ARI corrects RI for chance by allowing for the possibility that classification performed 
randomly should correctly classify some observations. Thus, ARI has an expected value of and perfect 
classification would result in a value of 1. 



7 Illustrative examples and considerations 

This section begins showing some relations between polynomial Gaussian CWM and related models. The 
section continues by looking at two applications on artificial and real data. 

Parameters estimation for finite mixtures of polynomial Gaussian regressions is carried out via the 
f lexmix function of the R-packagc f lexmix which, among other, allows to perform EM and CEM algo- 
rithms. If not otherwise stated, the number of repetitions, for the random initialization, will be fixed at 
t = 10 for all the considered models. 
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7.1 Preliminary notes 



Before to illustrate the applications of the polynomial Gaussian CWM on real and artificial data sets, it 
is useful to show some limit cases in a direct application of this model. 
Let 



P 



ft, 

(y \x;tt,k) = ^7Tj0 (y 



3=1 



(22) 



be the (conditional) density of a finite mixture of polynomial Gaussian regressions of Y on x. Moreover, 
let 

k 

p(x;tt,£) = ^2 -Kj4> \ x\ii, x \j,Gx\j) ( 23 ) 



j'=i 



be the (marginal) density of a finite mixture of Gaussian distributions for X . The following propositions 
gives two sufficient conditions under which the posterior probabilities of component-membership, arising 
from models f|22[) and (J23J) , respectively, coincide with those from the p olynomial Gau s sian C WM. 
Similar results for the linear Gaussian and the linear t CWM are given in 



Ingrassia et al 



Ingrassia et al 



(2012a 



and 



(I2012bf ). respectively. 



Proposition 1. Given k, r, tt and k, if ii X \x = • ■ • = Hx\k = t^x and o~x\\ = ■ ■ ■ = <Jx\k = a X> then 
model ([5]) generates the same posterior probabilities of component-membership of model (|22[1 . 

Proof. Given k, r, tt and k, if the component marginal densities of X do not depend from j, that is if 
Mxii = • • • = Hx\k = Ma" and o x \i = ■ ■ ■ = <J X \k = °~x, then the posterior probabilities of component- 
membership for the CWM in ([5]) can be written as 



P(Zij = l\k,r,-K,K,fix,crx) 



7Tj</> (yi x l ;n r {xuf3 3 ) , of. ) 4> (xi;fix,o- x ) 



3=1 



A, 

^J^i*X7^xlYl n ^ { Vi Xi] ( x i I Pj) > a l 3 
3=1 

7i"i</> (yi j^iSMr (^5 A,') >^ ) 



i = 1,. . . ,n, j = 1,. . .,k, 



3=1 



which coincide with the posterior probabilities associated to the model in ([22 



□ 

Proposition 2. Given k, r, 7r and if f3 1 = ■ ■ ■ = (3 k = (3 and o~ £l = ■ ■ ■ = o~ £k = o~ e , then model (O 
generates the same posterior probabilities of component-membership of model (|23|) . 
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Proof. Given k, r, n and if the component regression models does not depend from j, that is if 
(3 1 = ■ ■ ■ = (3 k = (3 and a ei = ■ ■ ■ = o~ £k = a e , then the posterior probabilities of component- membership 



for model (O can be written as 

P(Z iS = l\k,r,ir,£,0,* e ) 



2 



2 

A' \j 



y^TTj^ (yi \xi\Hr (xi;P) ,cxl) 4> (xi;iJ,x\j,o- 

3 = 1 




□ 



, i = l,...,n, j = 1, ...,k, 

i=i 

which coincide with the posterior probabilities of the model in (l23l) . 
7.2 Artificial data 

An artificial data set is here generated by a polynomial Gaussian CWM. One of the aims is to highlight 
a situation in which a finite mixture of polynomial Gaussian regressions provides a wrong classification 
although the underlying groups are both well-separated and characterized by a polynomial Gaussian 
relationship of F on i. 

The data consist of n = 700 bivariate observations randomly generated from a cubic (r = 3) Gaussian 
CWM with k = 2 groups having sizes ri\ = 400 and 712 = 300. Table ffl reports the parameters of the 
generating model. Simulated data are displayed in Figure [T] by what, from now on, will be simply named 

Table 1: Parameters of the cubic Gaussian CWM (k = 2) used to generate the data. 



(a) Regression parameters 



(b) Other parameters 





Component j = 1 


Component j = 2 




Component j = 1 


Component j = 2 


Ay 


0.000 


-8.000 




0.571 


0.429 


0y 


-1.000 


0.100 




-2.000 


3.800 


Ay 


0.000 


-0.100 




1.000 


0.700 


Ay 


0.100 


0.150 








1.600 


2.300 









as CW-plot. It allows to visualize the joint distribution of (X, Y) , the regression of Y on x, and the 
marginal distribution of X; as stressed from the beginning, these represent the key elements on which a 
CWM is based. A gray scale, and different line types (solid and dashed in this case), allow to distinguish 
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-6 -4 -2 2 4 6 

X 

Figure 1: CW-plot of the simulated data from a cubic Gaussian CWM {n = 700 and k = 2). 

the underlying groups in Figure [TJ The top of Figure [1] is dedicated to the marginal distribution of X. 
Here, we have an histogram of the simulated data on which are superimposed the component univariate 
Gaussian densities, multiplied by the corresponding weights 7Ti and tt2, and the resulting mixture. The 
scatter plot of the data is displayed at the bottom of Figure [T] The observations of the two groups 
are here differentiated by the labels 1 and 2 and by the different gray scales. The true cubic Gaussian 
regressions are also separately represented. The underlying (generating) joint density is also visualized 
via isodensitics; its 3D representation is displayed in Figure [5J 

Now, we will suppose to forget the true classifications Zi, i = 1, ...,n, and we will evaluate the 
performance of a finite mixture of cubic Gaussian regressions on these data. To make f lexmix in the 
best conditions, we have used the true classification as starting point in the EM algorithm and we have 
considered the true values of k and r. Nevertheless, without going into details about the estimated 
parameters, the clustering results are very bad, as confirmed by the scatter plot in Figure [3] and by a 
very low value of the adjusted Rand Index (AM = 0.088). 

On the contrary, as we shall see in a short time, the results obtained with the polynomial Gaussian 
CWM are optimal. In particular, differently from the previous model, in this case we preliminarily 
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Figure 2: Underlying joint density of the simulated data. 



estimate the number of mixture components k and the polynomial degree r according to what suggested 
in Section IrTTl Table [5] reports the values of 21 (^j , BIC, and ICL, obtained by using the EM algorithm, 
for a large enough number of couples (k,r), k = 1, . . . , 5 and r = 1, . . . , 5. Bold numbers highlight the 
best model according to each model selection criterion. It is interesting to note as BIC and ICL select the 
same (true) model characterized by k = 2 and r = 3. For this model, Table [3] shows the ML estimated 
parameters (and their standard errors in round brackets) while Figure H] displays the resulting scatter 
plot. Visibly, optimal results in terms of fit and clustering are obtained, as also corroborated by the 
value ARI = 1. 

Other experiments, whose results are not reported here for brevity's sake, have shown that a finite 
mixture of polynomial Gaussian regressions is not able to find the underlying group-structure when it 
also affects the marginal dis tribution of X. These results generalize to the case r > 1 the considerations 
that 



Ingrassia et al 



(I2012al ) make with reference to the linear Gaussian CWM in comparison with finite 



mixtures of linear Gaussian regressions. 



15 




Figure 3: Scatter plot of the artificial data with labels, and curves, arising from the ML-estimation 
(with the EM algorithm) of a finite mixture of k = 2 cubic Gaussian regressions. Plotting symbol and 
color for each observation is determined by the component with the maximum a posteriori probability. 



7.3 Real data 



The "places" data from the Places Rated Almanac (jSavageau and Loftus 



1985 ) are a collection of nine 



composite variables constructed for n — 329 metropolitan areas of the United States in order to measure 
the quality of life. There are k = 2 groups of places, small (group 1) and large (group 2), with the first 
of size 303. For the current purpose, we only use the two variables X = "health care and environment" 
and Y ="arts and cultural facilities" measu red so that the higher the sc ore, the better. These are the 



two variables having the highest correlation (jKopalle and Hoffman 



19921 ) . Figure [S] displays the scatter 



plot of X versus Y in both groups. In view of making clustering, the situation seems more complicated 
than the previous one due to a prominent overlapping between groups; this is an aspect that have to 
be taken into account in evaluating the quality of the clustering results. Furthermore, it is possible to 
see a clear parabolic functional relationship of Y on x in group 2. Regarding group 1, Table [4] shows 
the summary results of a polynomial regression (r — 5) as provided by the lm function of the R-packagc 
stats. At a (common) nominal level of 0.05, the only significant parameters appear to be those related 
to the parabolic function. 

Now, we will suppose to forget the true classification in small and large places and we will try to 
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Table 2: Values of 21 yipj , BIC and ICL for a polynomial Gaussian CWM. Different number of mixture 

components (k = 1, ...,5) and of polynomial degrees (r = 1,...,5) are considered. Bold numbers 
highlight the best values for BIC and ICL. 



(a) 21 ($) 



k 


\ r 


1 


2 


3 


4 


5 




1 


-7356.441 


-7317.690 


-6582.602 


-6552.363 


-6546.713 




2 


-5855.160 


-5653.833 


-5636.396 


-5636.224 


-5633.650 




3 


-5713.095 


-5634.769 


-5629.208 


-5584.800 


-5584.004 




4 


-5672.787 


-5618.283 


-5599.250 


-5549.372 


-5549.227 




5 


-5661.108 


-5616.798 


-5579.644 


-5548.913 


-5546.247 


(b) BIC 


k 


r 


1 


2 


3 


4 


5 




1 


-7382.645 


-7350.446 


-6621.908 


-6598.220 


-6599.122 




2 


-5914.120 


-5725.895 


-5721.560 


-5734.490 


-5745.019 




3 


-5804.810 


-5746.137 


-5760.229 


-5735.475 


-5754.332 




4 


-5797.258 


-5768.958 


-5776.129 


-5752.455 


-5778.515 




5 


-5818.334 


-5806.779 


-5802.380 


-5804.405 


-5834.495 


(c) ICL 


k 


T 


1 


2 


3 


4 


5 




1 


-7382.645 


-7350.446 


-6621.908 


-6598.220 


-6599.122 




2 


-5914.120 


-5726.044 


-5721.563 


-5734.492 


-5745.198 




3 


-5847.224 


-5747.712 


-5873.706 


-5740.202 


-5754.593 




4 


-5906.636 


-5914.615 


-5783.383 


-5759.843 


-5790.514 




5 


-5923.268 


-5967.415 


-5851.145 


-5965.444 


-5876.014 



Table 3: 
Gaussian 



Maximum likelihood estimated parameters, obtained with the EM algorithm, for a cubic 
CWM (k = 2). Standard Errors are displayed in round brackets. 



hi 
hi 
%i 



(a) Regression parameters 



(b) Other parameters 



'0j 



Component j = 1 Component j ' = 2 



Component j = 1 Component j = 2 



0.030 


-8.042 


*3 


0.571 


0.429 


(0.270) 


(10.391) 


(0.028) 


(0.024) 


-1.004 


-0.090 




-2.046 


3.814 


(0.398) 


(8.558) 


V-X\i 


(0.051) 


(0.039) 


0.041 


-0.038 




1.022 


0.682 


(0.194) 


(2.304) 




(0.036) 


(0.027) 


0.114 


0.147 








(0.028) 


(0.202) 








1.586 


2.502 








(0.056) 


(0.102) 
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Figure 4: Scatter plot of the artificial data with labels, and curves, arising from the ML-estimation 
(with the EM algorithm) of a cubic Gaussian CWM with k = 2. 



Table 4: Estimated parameters, and corresponding summary statistics, of a polynomial regression (of 
degree r = 5) fitted on group 1 via the R-function lm. 





Estimate 


Std. error 


i-value 


p- value 


An 


0.256 


0.705 


0.363 


0.71687 


0u 


4.024 


3.215 


1.251 


0.21176 


021 


-7.460 


5.033 


-1.482 


0.13937 


031 


7.580 


3.373 


2.247 


0.02537 


041 


-2.762 


0.993 


-2.782 


0.00574 


051 


0.328 


0.104 


3.141 


0.00185 



estimate it by directly considering the case k = 2. Figure [5] displays the values of 21 yip J , BIC, and ICL 
for the polynomial Gaussian CWM in correspondence of r ranging from 1 to 8. The best polynomial 
degree, according to the BIC (-1874.148) and the ICL (-1894.629), is r = 2. This result corroborates 
the above considerations. Table [5] summarizes the parameter estimates, and the corresponding standard 
errors, for the quadratic Gaussian CWM with k = 2. The CW-plot is displayed in Figure [71 The ARI 
results to be 0.208; the corresponding value for a finite mixture of k = 2 quadratic Gaussian regressions 
is 0.146. Note that, the ARI increases up to 0.235 if the quadratic Gaussian CWM is fitted via the CEM 
algorithm. 



18 




Figure 5: Scatter plot of X = "health care and environment" and Y ="arts and cultural facilities" for 
303 s mall (denoted with 1) and 2 6 large (denoted with 2) metropolitan areas of the United States. Data 

~~ 3 ( 



from Savagcau and Loftus ( 198 




polynomial degree - r 



Figure 6: Values of 21 ( ip ) , BIC and ICL, in correspondence of r = 1, . . . , 8, in the polynomial Gaussian 
CWM. 
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Table 5: Parameters of a quadratic Gaussian CWM estimated with the EM algorithm [k 
errors are displayed in round brackets. 

(a) Regression parameters 



2). Standard 



Ay 
hi 
Ay 



(b) Other parameters 



Component j = 1 Component j ' = 2 



0.677 


4.503 


(0.349) 


(0.877) 


0.191 


-1.821 


(1.081) 


(0.592) 


0.892 


1.019 


(0.787) 


(0.084) 


0.866 


2.220 


(0.058) 


(0.159) 



Component j = 1 Component j = 2 





0.662 


0.338 




(0.050) 


(0.039) 




0.699 


2.138 




(0.024) 


(0.136) 




0.294 


1.194 




(0.019) 


(0.082) 




===^ 2 2»*2 2-' 
2 2 222 2 , aOF- - ' 2 



2, ' 
- ' 2 



Figure 7: CW-plot of the quadratic Gaussian CWM fitted, via the EM algorithm, on the "places" 
data (k = 2). Plotting symbol and color for each observation is determined by the component with the 
maximum a posteriori probability. 



7.3.1 Classification evaluation 



Now, suppose to be interested in evaluating the impact of possible m labeled data, with m < n, on the 
classification of the remaining n — m unlabeled observations. With this aim, using the same data, we 
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have performed a simple simulation study with the following scheme. For each to ranging from 1 to 
250 (250/329=0.760), we have randomly generated 500 vectors, of size to, with elements indicating the 
observations to consider as labeled (using the true labels for them). The discrete uniform distribution, 
taking values on the set {1, . . . , n}, was used as generating model. In each of the 500 replications, the 
quadratic Gaussian CWM (with k = 2) was fitted, with the EM algorithm, to classify the n — m unlabeled 
observations. Figure [5] shows the average ARI values, computed across the 500 replications only on the 
n — m unlabeled observations, for each considered value of to. To give an idea of the conditional variability 



< 




50 100 150 200 

m 



Figure 8: Average ARI values, with respect to 500 replications, for each considered value of to (central 
line with the highest width). The simulated quantiles of probability 0.025, 0.05, 0.95 and 0.975, are also 
superimposed. 

of the obtained results, quantiles of probability 0.025, 0.05, 0.95 and 0.975, are also superimposed on the 
plot. As expected, the knowledge of the true labels for m observations tends (on average, as measured 
by the ARI) to improve the quality of the classification for the remaining observations when m increases. 

8 Discussion and future work 

An extension of the linear Gaussian CWM has been introduced which allows to model possible nonlinear 
relationships in each mixture component through a polynomial function. This model, named polynomial 
Gaussian CWM, was justified on the ground of density estimation, model-based clustering, and model- 
based classification. Parameter estimation was carried out within the EM algorithm framework, and the 
BIC and the ICL were used for model selection. Theoretical arguments were also given showing as the 
posterior probabilities of component-membership arising from some well-known mixture models can be 
obtained by conveniently constrained the parameters of the polynomial Gaussian CWM. The proposed 
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model was then applied on a real and an artificial data set. Here, excellent clustering/classification 
performance was achieved when compared with existing model-based clustering techniques. 

Future work will follow several avenues. To begin, extension of the polynomial Gaussian CWM, to 
more than two variables, will be considered. This extension could initially concern only the X variable 
and then involve also the Y one. Afterwards, the polynomial t CWM could be introduced by simply 
substituting the Gaussian density with a Student-t distri bution providing more robust inference for 



data characterized by noise and outliers (jLange et al 



19891 ). Finally, it could be interesting to evaluate, 



theoretically or by simulatio n, the most convenient model selection criteria for the proposed model. Note 



that this search, in line with 



Biernacki et al 



(2000, P- 724), could be separately conducted according to 



the type of application, direct or indirect, of the model. 
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